Broadcasting machine learning data

ABSTRACT

There is provided a processor configured to transfer data to a plurality of processor circuits. The apparatus includes broadcast circuitry that broadcasts first machine learning data to at least a subset of the plurality of processor circuits.

TECHNICAL FIELD

The present disclosure relates to systems, methods and apparatuses for optimizing machine learning processing. In particular, the present disclosure relates to optimizing the transmission of data in a neural network, such as a convolutional neural network.

DESCRIPTION

Neural networks, such as convolutional neural networks (CNNs) typically comprise a number of processing layers which can be broadly characterized as either Input Layers, Computational (Hidden) Layers or Output Layers. Each Computational Layer in the network receives input data, typically in the form of a Feature Map (FM), (sometimes referred to as an Input Feature Map (IFM)) from either an Input Layer or a preceding Computational Layer. Each Computational Layer processes the received input data and outputs processed data, typically in the form of an Output Feature Map (OFM), the output data being passed either to the next Computational Layer, or an Output Layer. Each Computational Layer processes the received input data with a layer-specific kernel (sometimes also referred to as a filter). The kernel may either be a pre-defined operator, or a function created during training of the neural network. The kernel may comprise at least one of weight, bias, or activation function.

Processors are used to execute such neural networks. In many cases specialized processors are used, such as neural processing units (NPUs) and other custom processors specifically adapted for neural network computations. It is also possible to use generalized processors to perform neural network computations, including central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), etc. These processors may also have on-board storage (internal memory), for example in the form of static random-access memory (SRAM). However, even with modern specialized processors, all but the very simplest of neural networks require significant processing resources (memory and computer resources) to implement, due to the size of the input data being processed and the complexity of the calculations being performed. These significant processing requirements mean that it can take a reasonable amount of time to perform a machine learning task.

It is common for the processing to be apportioned to multiple processors, for example between multiple processors in a CPU cluster, multiple shader cores in a GPU, or multiple execution units in an NPU, to increase machine learning processing throughput and to reduce latency. A processing layer may be divided and executed on multiple processors, simultaneously alternatively and, as well as the processors may be assigned to process a plurality of processing layers so that the plurality layers are processed in a pipelined manner.

Neural network processing typically requires a significant amount of data to be fetched, for example the kernels, input data, and data structures and/or programs to perform the processing, from main memory. Neural network processing also generates a significant amount of data. If the processors on board storage (storage circuitry and/or processor local-storage circuitry) is not large enough generated intermediate data will spill to main memory. The large number of memory accesses to both main memory and on-board storage consume a significant amount of energy, which may exceed the amount of energy consumed by the neural network processing. It is therefore desirable to increase the efficiency of these memory accesses.

Neural network processing is deterministic. The memory accesses performed to perform the processing can be computed and determined in advance. Memory requests can therefore be generated in advance, (prefetched), to ensure that the data is available to the processor in time for processing. The amount of computation required to perform the processing is also deterministic and known in advance. Therefore, if a processing layer is divided equally between and executed on a plurality of processors, the plurality processors will (likely) complete their processing at a similar time. Each processing layer may have different memory and processing requirements, however because the processing is deterministic it is possible, for example using a mathematical model, to determine the amount of time it will take to perform the processing of a processing layer on a specific processor. Therefore, it is possible in advance, for example using a driver, to determine efficiently schedule a machine learning task to a plurality of processors.

It is therefore desirable to increase the efficiency of the memory accesses across the layers of a neural network to the plurality of processors, to minimize energy consumption and memory bandwidth.

SUMMARY

Viewed from a first example configuration, there is provided a processor comprising: a plurality of processor circuits to perform a machine learning process; and broadcast circuitry configured to broadcast data to at least a subset of the plurality of processor circuits; wherein the processor is configured to obtain at least a first subset of machine learning data from memory to storage circuitry; and broadcast the at least a first subset of machine learning data from storage circuitry to at least a subset of the plurality of processor circuits.

Viewed from a second example configuration, there is provided a data processing method of transferring data to a plurality of processor circuits, the data processing method comprising: fetch at least a first subset of machine learning data from memory to storage; and broadcast the at least a first subset of machine learning data from storage to at least a subset of the plurality of processor circuits.

Viewed from a third example configuration, there is provided a non-transitory computer-readable medium to store computer-readable code for fabrication of an apparatus comprising: a plurality of processor circuits to perform a machine learning process; and broadcast circuitry configured to broadcast data to at least a subset of the plurality of processor circuits; wherein the processor is configured to: obtain at least a first subset of machine learning data from memory to storage circuitry; and broadcast the at least a first subset of machine learning data from storage circuitry to at least a subset of the plurality of processor circuits.

BRIEF DESCRIPTION OF THE DRAWINGS

The present technique will be described further, by way of example only, with reference to embodiments thereof as illustrated in the accompanying drawings, in which:

FIG. 1 shows an example data processing apparatus;

FIG. 2 shows an exemplary processing system;

FIG. 3 shows schematically the compiling of a shader program for execution by a graphics processing pipeline;

FIG. 4 schematically shows an apparatus;

FIG. 5 illustrates an example of the level two cache;

FIG. 6 illustrates a flowchart that shows an alternative process to reduce the extent to which unnecessary bandwidth consumption occurs;

FIGS. 7A and 7B collectively show how the operational mode (broadcast kernel mode vs feature map broadcast mode) can be dynamically changed at runtime;

FIG. 8 illustrates, in the form of a flowchart, an example of the work dispatch process.

Like reference numerals are used for like components where appropriate in the drawings.

DESCRIPTION OF EXAMPLE EMBODIMENTS

Before discussing the embodiments with reference to the accompanying figures, the following description of embodiments and associated advantages is provided.

In accordance with one example configuration there is provided a processor comprising: a plurality of processor circuits to perform a machine learning process; and broadcast circuitry configured to broadcast the data to at least a subset of the plurality of processor circuits; wherein the processor is configured to: obtain at least a first subset of machine learning data from memory to storage circuitry; and broadcast the at least a first subset of machine learning data from storage circuitry to at least a subset of the plurality of processor circuits.

In the above examples, one item of machine learning data is sent to a plurality of processor circuits (at least a subset of all the processor circuits, which may include but need not be all of them). Each of the processor circuits that has been sent the first machine learning data also obtains second machine learning data. This is expected to be different for each of the processor circuits in question. Each processor circuit then performs processing using the first machine learning data and the second machine learning data. Since the first machine learning data can be broadcast to the (at least a subset of) processing circuits, there is a reduction in resource consumption as opposed to a situation in which each processing circuit individually fetches the first machine learning data and the second machine learning data. Broadcast is defined to mean broadcast or multicast in this specification.

In some examples, the first machine learning data is a kernel and the second machine learning data is a feature map, or the first machine learning data is the feature map and the second machine learning data is the kernel.

In some examples, the first machine learning data is a program and/or data structure to control the at least a subset of the processor circuits, and the second machine learning data is a feature map or kernel.

In some examples, the first machine learning data is a program and/or data structure to control the at least a subset of the processor circuits and the second machine learning data is a feature map, or the first machine learning data is the program and/or data structure to control the at least a subset of the processor circuits and the feature map and the second machine learning data is the kernel.

Machine learning data may be at least one of a feature map, a kernel or a program or data structure to control the at least subset of the processor circuits.

In some examples, the storage circuitry is a cache. For instance, the cache might be a level three cache or a last level cache, or a level two cache. The cache may form part of a memory hierarchy together with at least a level one cache and a main memory backed, e.g. by DRAM.

In some examples, the apparatus comprises: the processor circuits, wherein the processor circuits are configured to store processed data, generated as a result of processing the second machine learning data with the first machine learning data, back to the storage circuitry. Having processed the first machine learning data with the second machine learning data, a result is produced. This result might initially be stored within local circuitry of the processor circuit that generated it (particularly if further processing is to be performed by that processor circuit), but ultimately is sent to the storage circuitry. By sending the data back to the storage circuitry (rather than directly to, for instance, another processor circuit), issues of coherency can be greatly simplified—the latest version of data is either stored in a specific, known processor circuit, or it is stored in the storage circuitry.

In some examples, the apparatus is configured to operate in a kernel broadcast mode in which the first machine learning data is a kernel and the second machine learning data is a feature map; and the apparatus is configured to operate in a map broadcast mode in which the first machine learning data is the feature map and the second machine learning data is the kernel. In these examples, the apparatus is not limited to an either/or situation and can instead change between broadcasting the kernel and broadcasting the feature map.

In some examples, the apparatus is configured to dynamically change between the map broadcast mode and the kernel broadcast mode. The change between the map broadcast mode and the kernel broadcast mode can therefore happen, e.g. at runtime, and on demand.

In some examples, the apparatus is configured to dynamically change between the map broadcast mode and the kernel broadcast mode in dependence on a layer of neural network to which the kernel and the feature map relate. A neural network may be made up of a number of different layers. For instance, at a first layer, a first kernel may be applied to an input feature map to generate an output feature map, which becomes an intermediate feature map to a second layer that applies a second kernel and generates a further output feature map that becomes the intermediate feature map to a third layer and so on. As the layers are applied, the size of the input/intermediate feature maps may grow or shrink. In addition, the number of kernels applied in each layer (e.g. to an input/intermediate feature map), and the number of channels for each kernel might also change as the layers are applied. By broadcasting the largest of these (the kernel data or the feature map data) a greater saving of bandwidth and energy consumption can be made, as opposed to a situation where a larger number of transactions must occur. In some examples, a further consideration is whether and which of the kernel and feature map will fit into an on-board storage of the processor circuits. In particular, prior to considering which of the feature map and the kernel is larger, the process might firstly rule out whichever of these is too large for the on-board storage of the processor circuits. By storing the largest of the kernel and feature map that will fit into the on-board storage of the processor circuits, it is possible to reduce the number of external memory accesses and therefore improve performance and energy consumption.

In some examples, the broadcast circuitry is configured to broadcast the first machine learning data to at most a subset of the plurality of processor circuits; the broadcast circuitry is configured to broadcast third machine learning data, different to the first machine learning data, to a further subset of the plurality of processor circuits; the subset of the plurality of processor circuits and the further subset of the plurality of processor circuits are mutually exclusive; and the first machine learning data and the third machine learning data relate to different layers of a neural network. In these examples, different bits of machine learning data are broadcast to different subsets of the processor circuits—with each processor circuit also acquiring its own second item of machine learning data. This makes it possible, for instance, for different processor circuits to operate on different layers of the neural network simultaneously. In the case of convolution layers, deconvolution layers, and recurrent layers, all of which use spatially local processing, it is possible to pass generated results in one layer directly to the next layer as a ‘pipeline’ of processing where each processor circuit does processing for one of several layers and each of the processor circuits operate in parallel.

In some examples, the processor circuits are CPU cores. In some examples, a plurality of CPU cores shares a storage circuitry, such as a level two cache. Each core may include its own local-storage circuitry (e.g. a level one cache). The machine learning data can be stored within a main memory and fetched into the storage circuitry (e.g. level two cache). The data in the storage circuitry may then be sent to and stored within the processor local-storage circuitry, operated on by the CPU core, and then sent back to the storage circuitry, e.g. with the results of other CPUs.

In some examples, the processor circuits are shader cores. Shader cores are processor circuits that are typically found in a Graphics Processing Unit (GPU). They are often capable of executing arbitrary programs (shader programs) on inputs. The circuitry of such units, being intended for graphics processing, is particularly well suited to performing large numbers of floating and fixed-point mathematical operations in parallel. This therefore also happens to be well suited to performing computation tasks, such as machine learning (which also involves a large number of mathematical operations being performed). Graphics Processing Unit (GPU) are therefore often used to process machine learning training and inference tasks.

In addition, Graphics Processing Unit (GPU) may use, machine learning in rendering. For example, to perform the rendering task more efficiently and/or with higher quality by using machine learning in some of the processing stages. The technology described herein, may be therefore particularly applicable in Graphics Processing Unit (GPU).

In some examples, the rendering process is a rasterization process, in rasterization the rasteriser receives graphics primitives for rendering, rasterises the primitives to sampling points and generates graphics fragments having appropriate positions (representing appropriate sampling positions) for rendering the primitives. The fragments generated by the rasteriser are then sent onwards to the rest of the pipeline for processing. Machine learning super sampling, machine learning anti-aliasing, denoising, super resolution, are examples of machine learning techniques that can be used to perform rasterization more efficiently.

In some examples, the rendering process is a ray-tracing, or path-tracing process. Ray-tracing is a technique used for light rendering (particularly in 3D rendering), which allows for more realistic rendering by modelling the light rays themselves. Since the only rays of light that are relevant in such a frame are those that strike the (virtual) camera, one can trace the rays of light backwards from the camera to the source(s). This process is very expensive. The more rays of light that are traced the more realistic the scene is likely to be (at least up to a point). However, each additional ray of light that is traced increases the complexity of the process. One technique is therefore to reduce the number of traced rays (e.g. preferably to one per pixel) and then to compensate for a small number of rays being traced by applying machine learning techniques. However, in tracing a smaller number of rays (which thereby reduces the computation requirement) the result is likely to be ‘noisy’. It is therefore possible to perform a denoising process using machine learning in order to clean up the ray-traced image.

In some examples, the rendering process is a hybrid-ray-tracing process. Hybrid ray-tracing is a technique that uses both rasterization and ray-tracing to render an image. As hybrid ray-tracing makes use of rasterization and ray-tracing, at least one of the aforementioned techniques, machine learning super sampling, machine learning anti-aliasing, machine learning super resolution, machine learning denoising may be used by this rendering process.

A Graphics Processing Unit (GPU) may use immediate mode rendering, where the vertex and fragment processing performs vertex and fragment processing in sequence for each primitive. Alternatively, a tile-based rendering, performed by a tile-based graphics processor, rendering approach may be used. In tile-based rendering rather than the entire render output, e.g., frame, effectively being processed in one go as in immediate mode rendering, the render output, e.g., frame to be displayed, is divided into a plurality of smaller sub-regions, usually referred to as “tiles”. Each tile (sub-region) is rendered separately (typically one-after-another), and the rendered tiles (sub-regions) are then recombined to provide the complete render output, e.g., frame for display. In such arrangements, the render output is typically divided into regularly-sized and shaped sub-regions (tiles) (which are usually, e.g., squares or rectangles), but this is not essential.

The render output for a tile sub-region, is stored in a tile buffer (local-storage circuitry). The tile buffer is provided as part of RAM that is located on (local to) the graphics processor shader core. Once the render output for a tile is complete the contents of the tile buffer is typically written to a frame buffer in main memory.

The render output data array may typically be an output frame intended for display on a display device, such as a screen or printer, but may also, for example, comprise intermediate data intended for use in later rendering passes (also known as a “render to texture” output), etc. In render-to-texture process the result can then be written to a buffer in main memory. The render-to-texture buffer may then be used as an input in a further rendering process.

The processor, for example a CPU, or a Graphics Processing Unit (GPU) typically has a multi-level memory system, each processor circuit (shader core) containing local-storage circuitry, in a tile-based graphics processor, the local-storage circuitry comprises at least tile buffers. The next level of the memory system, the storage circuitry is typically a level two cache. The level two cache may be shared between the plurality of processor circuits.

In a Graphics Processing Unit (GPU), tile buffers are used to store the fragment data while it is operated on by the shader core. A shader core may store fragment data relating to a tile of rendered data in the tile buffer. The fragment data generated by the shader core in a rendering process and stored in the tile buffer may be image data. However, for example in a deferred shading rendering process, the fragment data generated by the shader core may comprise at least one of image data, positions, normal, materials, depth data. This information may be stored in a buffer, known as a G-buffer (geometry buffer). A portion of this fragment data, the fragment data associated with a tile, may be stored in a tile buffer. The tile buffers are therefore suitable for holding machine learning data (e.g. feature maps and/or kernels) and for storing the results. In some of these examples, the machine learning data may be split across a plurality of shader cores found within a single shader core.

In some examples, the broadcast circuitry is configured to broadcast the first machine learning data to tile buffers in the at least a subset of the plurality of processor circuits.

The processor, for example a CPU, or a Graphics Processing Unit (GPU) may execute tasks that may or may not require memory coherency between the processor circuits. To facilitate efficient coherent memory access the storage circuitry (e.g. level two cache) may contain a snoop filter. A snoop filter maintains a directory, for each entry of the cache, that monitors all coherent traffic in order to keep track of the coherency state of the entry. For non-coherent entries, the snoop filter entry is not used.

Machine Learning tasks are deterministic and are therefore likely to use non-coherent memory accesses. In the technology described herein, to improve the efficiency of machine learning memory access the memory system supports broadcast transactions. For non-coherent entries, the snoop filter entry may be used to store broadcast information for that entry.

In some examples, the apparatus comprises: the storage circuitry, wherein the storage circuitry is configured to store, in association with each entry, an indication of whether that entry is to be broadcast by the broadcast circuitry. Each entry in the storage circuitry can be marked in order to indicate whether it should be broadcast to a plurality of the processor circuits or not. This can be used to differentiate it from other data in the storage circuitry that should be provided (or obtained) singly. It is therefore possible for the first machine learning data marked in this manner to be proactively sent to the (at least a subset) of processor circuits, without each processor circuit having to individually request the first machine learning data and thereby send it out as part of a broadcast.

Machine learning data contains many similar, and repeated values and is therefore highly compressible. It is therefore common for the machine learning data to be stored in main memory in a compressed form. To minimise processing, and energy consumption, in an example, broadcast data is read from main memory in compressed form and decompressed and stored in storage circuitry (for example a cache), before being broadcast in uncompressed form to the plurality of processor circuits (for example shader cores). In an example, non-broadcast data is read from main memory in compressed form and stored in storage circuitry (for example a cache), the non-broadcast data is transferred from the storage circuitry to a processor circuit (for example a shader core) in compressed form and decompressed in the processor circuit (for example a shader core).

In some examples, the first machine learning data is stored in an uncompressed form in the storage circuitry. In these examples, there is no need for further decompression to be performed on the first machine learning data once it reaches the processor circuits. Consequently, energy is reduced as a consequence of a large number of processor circuits performing decompression on the same (first neural network) data.

In some examples, the apparatus comprises: decompress circuitry configured to decompress the first machine learning data as it is stored into the storage circuitry, wherein the storage circuitry is a cache; and the fetch circuitry is configured to fetch or stream the first machine learning data in a compressed form from a main memory. The decompress circuitry can be used in order to decompress the first machine learning data when it is obtained from the main memory. The decompressed data can then be stored in the cache, which is fetched/streamed and then broadcast to the processor circuits. In this way, it is not necessary for each of the processor circuits to perform decompression on the same first machine learning data. Instead, it can be performed once before being broadcast.

In some examples, the apparatus comprises: comparison circuitry configured to compare a hash of the feature map with a hash of the processed data and to store the processed data back to the storage circuitry in dependence on the hash of the feature map and the hash of the processed data differing. If the application of the first neural network to the second neural network does not provide new data (i.e. if the feature map is not changed as a result of the application of the kernel) then the act of writing the unchanged feature map back to the storage circuitry can be inhibited, thereby reducing bandwidth and energy consumption. In practice, this should have no effect on any computation, since the result remains the same. Furthermore, where writes involve compression, there is no need for the compression to be performed since the data remains unchanged. Note that in these examples, it may be necessary to assert the necessary signals to state that the write has been performed (even though it hasn't) in order to indicate that the processing has been performed.

Machine learning processing is tasks are typically writing in a Machine Learning framework, such as TensorFlow, Caffe, ONNX. There may be a software layer, a machine learning platform library, between the Machine Learning framework and the targeted processor, which enables efficient translation of the supplied task, (e.g. Arm NN). The target processor, such as a Graphics Processing Unit (GPU), may have a driver executing on a host processor (e.g. CPU) which generates programs (shader programs), data structures and commands for executing the task on the Graphics Processing Unit (GPU).

The driver may analyse the supplied machine learning task, (e.g. number, size and complexity of each layer of the neural network), machine learning task requirements, (e.g. latency, throughput, processing resources available), type, number and capability of the processors, and size, capability (e.g. compression) and structure of the memory system. The driver will then generate data structures, programs (shader programs) and commands for execution by the Graphics Processing Unit (GPU). The commands, data structures, and programs (shader programs) generated by the driver, may specify how the machine learning task should be broken down and executed on the plurality of processors, to meet the machine learning task requirements and to maximise efficiency.

For example, where the machine learning task comprises a large, medium and a smaller layer. The driver may divide the large and medium layer into chunks, each chunk executed substantially simultaneously on a different processor (shader core). The driver may allocate the small layer to a single processor. Therefore, in an example, a single layer, where the processing has been divided into chunks, may be executed substantially simultaneously by a plurality of processors (shader cores). If the layer is divided equally between the processors (shader cores), as ML processing is deterministic, each processor should complete its processing at roughly the same time. In an example multiple layers may be processed substantially simultaneously by a plurality of processors. A first layer is processed by an at least one first processor, and a second layer which uses the output of the first layer is processed by an at least one second processor, where the first and second processors are different. As the at least one first processor processes the first layer, it generates and outputs data (e.g. Output Feature Map (OFM)). The at least one second processors read the output from the at least one first processors (as it is generated) and performs the second layer processing on the data (Feature Map). The first and second layers executing on at least one first and second processors in a pipeline manner. The described processor scheduling and processor affinity information is provided in at least one of the commands, data structures and programs generated by the driver. The scheduling and affinity information may assume that the machine learning task is executed exclusively on the Graphics Processing Unit (GPU). However, the GPU may execute multiple tasks simultaneously.

A job manager in the Graphics Processing Unit (GPU) reads the commands, data structures and programs generated by the driver, and determines the (processing) resources that are available. The job manager then schedules the machine learning task to the processors. Therefore, in an example, if a machine learning task is to be executed whilst a graphics processing task is executing, there will be fewer resources available to the machine learning task and the job manager will modify the scheduling as specified by the driver commands, data structure and programs appropriately.

Where a layer is divided into chunks and executed on a plurality of processors, a portion of the data sent to the plurality of the processors may be unique to that processor, and a portion of the data sent to the plurality of the processors will be the common or similar for the plurality of processors.

The driver may indicate in the generated commands, data structures and programs which data is common for a plurality of processors and should preferentially be broadcast to the plurality of processors. The job manager in the graphics processor reads the commands, data structures and programs generated by the driver, and may modify the scheduling, data to be broadcast, and processing in dependence on the available resources. The job manager then dispatches the tasks to the plurality of processors.

In an example, storage circuitry or a processor contains a prefetch engine or state machine which fetches the common data, preferably just one of the plurality of processors issues memory requests for the common data to the (shared) storage (e.g. level two cache). A broadcast indicator in the memory access indicates at least one of a broadcast flag, broadcast destination (processors the data is to be broadcast to), and broadcast address (where the broadcast data is to be written in the processors local storage). This broadcast indicator may be written to the entry in the (shared) storage circuitry. When the data is available in the (shared) storage circuitry, the broadcast indicator is interrogated for the entry, if the entry is marked for broadcast, the broadcast destination is determined and the entry is sent to the specified plurality of processors, to be stored in the plurality of processors local storage at the address specified by the broadcast address.

In an example, the program executed on a processor performs memory requests to fetch the common data. Where the program executed on the plurality of processors is the same, the job manager may indicate that only one of the plurality of the processors execute the instructions to fetch the common data.

In an example, there may be a program to fetch the common data, unique data and perform processing, that is executed on one processor of a plurality of processors, whilst the other of the plurality of processors execute a different program, which fetches the unique data, and performs processing.

In an example, a region of memory may be specified to be broadcast, this information may be stored in a data structure, (e.g. page tables) used by a memory management unit (MMU). When a memory transaction is generated the page table entry for that memory region is interrogated. If the page table entry indicates the memory transaction is to a broadcast region of memory, a broadcast memory access is performed.

The GPU job manager provides tasks to each processor (shader core). The job manager will not issue a new task to a processor (shader core) until it has completed its current task. In an example, if the machine learning layer processing for a layer is divided between a plurality of processors (shader cores), the job manager may ensure that all of the plurality of processors (shader cores) have finished processing their region before the next layer processing is started. The job manager may therefore stop issuing new tasks to the plurality of processors (shader cores) until the plurality of processors (shader cores) have all completed their current task.

It will further be appreciated that not all steps of the methods of the technology described herein need be carried out by computer software and thus from a further broad embodiment the technology described herein provides computer software and such software installed on a computer software carrier for carrying out at least one of the steps of the methods set out herein.

Particular embodiments will now be described with reference to the figures.

FIG. 1 illustrates an example data processing apparatus 100. The example shows a single shader core 110 (although several such devices might be present) in communication with bus 105 (e.g. wire). Each shader core 110 contains an execution engine 130 and one or more tile buffers 120. The tile buffers are generally responsible for holding data relating to a tile (a contiguous section of an image). For instance, each tile might hold 16×16 pixels, 32×32 pixels, 64×64 pixels, 128×128 pixels, or more generally, 2^(n)×2^(m) pixels (where n and m are positive integers, in an example n and m are equal). The execution engine 130 is able to execute a number of “programs” or “shader programs”. As well as being able to perform general purpose computation, the execution engine 130 may contain a number of different specialised units. In this example, the shader core 130 is shown to contain a rasterizer 140, which takes vertex data and converts the vertex data into fragment data (i.e. that can be stored in the tile buffers 120). A machine learning (ML) execution unit 137 specialises in (e.g. accelerates) machine learning operations (and therefore, for instance, might be well suited to perform matrix operations). In some examples, the machine learning operations are performed by a shader program that is processed by the execution engine 130. In some cases, machine learning operations may be simultaneously performed in the execution engine 130 and the ML execution unit 137. A machine learning processing might only start if there is a free tile buffer 120 for the feature map and kernels to be written.

The shader core 110 also includes a compare circuit 195. These circuits are used to determine whether a write back operation to a cache 180, 187, 190 is necessitated or not, as will be discussed with reference to FIG. 6 .

Also connected to the bus 105 is a tiler 150. The tiler 150 generates a tile list that indicates which primitives should be rendered for which tiles. Meanwhile, the job manager 135 generates fragment processing jobs for each tile, using the tile lists.

Collectively, the job manager 135, tiler 150, and shader core(s) 110 can make up a graphics processing unit (GPU) 175, although it will be appreciated that the GPU may comprise additional or alternative components/features.

Parts of a memory hierarchy are also connected to the bus 105. For instance, this may include a main memory 160 (e.g. backed by DRAM) and one or more caches 180, 187, 190, which in this case include a level one cache 180 a level two cache 187, and a level three cache 190 that acts as a last level cache (LLC). After processing is performed by the shader cores, the processed data is written to the main memory 160. Note that the main memory 160 may be “off-chip” as compared to the other components and, typically, accesses to or communications with the main memory 160 are slower than accesses to or communications with other components.

The inventors of the present technique have realised that it is helpful for the execution engine 130 to be capable of performing ‘complex rendering’ in which a traditional rendering process (e.g. rasterization) is coupled with a machine learning process. To this end, a machine learning execution unit 137 is also provided within the shader core 110 (although machine learning can also be performed through other specialised units within the execution engine 130 or even through generic circuitry in the execution engine 130 itself, e.g. using software). Furthermore, the tile buffers 120 are used during the rendering process to perform traditional rendering while during the machine learning process, the tile buffers are used to store data relating to machine learning on the tile such as the weights associated with kernels or feature maps. By keeping the data for a tile local, there is no need for the data to be transported to and from the shader core 110 and part of the memory hierarchy 160, 180, 190.

Also connected to the bus 105, in this example, in a central processing unit (CPU) 115. In this example, the CPU 115 hosts the driver 125, which generates data structures and programmes that are used by a GPU job manager 135, which in turn dispatches work to the shader cores 110, as will be described in more detail below in, for instance, FIG. 8

FIG. 2 shows an exemplary processing system. An application 2, such as a game or machine learning task, executing on a host processor 1 will require operations to be performed by an associated graphics processing unit (GPU) (graphics processor) 3. To do this, the application 2 executing on the host processor 1, will generate API (Application Programming Interface) calls that are interpreted by a driver 4, which is also executing on the host processor 1. The driver generates commands for the graphics processor 3, to generate output required by the application 2. To facilitate this, a set of “commands” will be provided to the graphics processor 3 in response to commands from the application 2 running on the host system 1 for output.

FIG. 3 shows the shader program being provided in the high-level shader programming language 301 by the application 2 to the driver 4, which then compiles 302 the shader program to the binary code 303 for the graphics processing pipeline. Each shader core is a processing stage that performs graphics processing by running small programs (shader programs) for each “work” item in an output to be. For each work item to be processed, an execution thread that will execute the corresponding shader program is issued to appropriate programmable processing circuit(s) that then executes the shader program for the execution thread in question.

FIG. 4 shows an apparatus 800. The apparatus 800 might be the same apparatus as illustrated with respect to FIG. 1 and elements with the same name perform substantially the same function as already described. However, FIG. 8 is provided to focus on a different aspect that can be achieved with the same (or similar) system. The apparatus includes a Graphics Processing Unit (GPU) 875. The GPU 875 includes a number of shader cores (SC1, SC2, . . . SCN) 810 a, 810 b, 810 c. Each shader core includes an execution engine 830 a, 830 b, 830 c, which is responsible for executing small programs (shader programs) on input data, together with a set of tile buffers 820 a, 820 b, 820 c. In graphics processing tasks, the tile buffers are responsible for holding data relating to a tile, which is a contiguous 2D area of a frame, and which is to be processed by the shader program running on the execution engine 830 a, 830 b, 830 c of a shader core. The tile buffers might have a storage capacity of, for instance, 1 kB and four tile buffers 820 a might be provided for each core 810 a for a total storage of 4 kB per shader core 810 a.

A tiler 850 is provided in order to separate screen space primitives into a number of tile regions. These are then sent to the shader cores 810 a, 810 b, 810 c by the job manager 835.

The present examples provide an efficient way of enabling machine learning data to be processed using the shader cores 810 a, 810 b, 810 c. Since the shader cores are well suited to performing large amounts of parallelised mathematical processing for graphics tasks, they are also well suited to machine learning operations (which typically involve matrix operations).

The job manager 835 determines a next machine learning operation to be performed and determines which data is required for the operation to be performed. Fetch circuitry 815 is provided to obtain machine learning data from a main memory 860, which might be backed by a DRAM. During this process, the data from the main memory 860 might need to be decompressed by compress/decompress circuitry 825. The data can then be stored into local storage circuitry (e.g. a level two cache 887) in decompressed form.

The job manager then selects a set of the shader cores 810 a, 810 b, 810 c with which to perform the processing for the machine learning operation. Machine learning operations typically comprise a number of layers in which a feature map is continually operated upon. At each layer a kernel, which is a part of a model that has been trained for performing the operation, is applied to an input or intermediate feature map in order to produce an output feature map. The output feature map of one layer then becomes an input feature map to the next layer. The kernel at each layer is unmodified by the processing operation. The machine learning operation performed by the execution engine 830 a, 830 b, 830 c therefore requires a kernel and a feature map to operate. Typically, however, one of these items of machine learning data (the kernel or the feature map) will be used by all of the selected shader cores while the other item of machine learning data (the feature map or the kernel) will be specific to each shader core. The fetch circuitry 815 is used to fetch the machine learning data from the level two cache 887. Broadcast circuitry 855 is then used to broadcast the machine learning data that is common to the selected shader cores 810 a, 810 b, 810 c while the machine learning data that is specific to each individual shader core 810 a, 810 b, 810 c is individually provided using dispatch circuitry 845.

Note that since the data has already been decompressed when it is stored into the level two cache 887, there is no need for the common machine learning data to be repeatedly fetched from main memory 860, nor repeatedly decompressed, nor repeatedly transmitted to the shader cores 810 a, 810 b, 810 c and thus bandwidth and energy consumption are reduced.

In some cases, a plurality of different machine learning operations might be performed (e.g. operating on different layers of a neural network) in which case one subset of the shader cores 810 a, 810 b might operate on one set of data and a second subset of the shader cores 810 c might operate on another set of data. In these cases, multiple broadcasts might be made by the broadcast circuitry 855, each to a different subset of the shader cores 810 a, 810 b, 810 c.

Each of the shader cores 810 a, 810 b, 810 c also contains compare circuitry 895 a, 895 b, 895 c. These circuits are used to determine whether a write back operation to the level 2 cache 887 is necessitated or not, as will be discussed with reference to FIG. 6 .

FIG. 5 illustrates an example of the level two cache 887. The structure of the level two cache 887 includes a tag field 910, which is used to index individual entries of the cache structure so that they can be located, and a data field 920, which contains the actual data of interest. For each entry, a validity flag 930, dirty flag 940, and broadcast flag 950 is provided. The validity flag 930 is used to indicate whether a particular entry is valid and should therefore be used or not. The dirty flag 940 is used to indicate whether the entry has been modified since being acquired from main memory 860 (where the ‘official’ copy of the data resides). Finally, the broadcast flag 950 is used to indicate whether the data should be broadcast by the broadcast circuitry 855 when it is sent out, or unicast by the dispatch circuitry 845 when it is sent out. This is determined by the job manager 835 when the data is fetched from the main memory 860 into the level 2 cache 887 depending on the nature of the machine learning operation to be performed.

Note that in this example, a simple ‘broadcast’ flag is provided, which indicates whether the data in the data field 920 should be broadcast to all of the shader cores 810 a, 810 b, 810 c or not. In other embodiments, the cache may replace the broadcast flag 950 with a broadcast destination to indicate which of the shader cores 810 a, 810 b, 810 c the data in the data field 920 should be broadcast to. The broadcast flag 950 (and/or alternatively the broadcast destination) can also be used to differentiate the broadcastable data from other data that might be present in the level 2 cache 887. In particular, this can be used to inhibit or prevent data that is intended to be broadcast to the processing units from being pulled into instruction caches or other data caches and can enable the data to instead be directed towards the tile buffers 820 a, 820 b, 820 c of the relevant shader cores 810 a, 810 b, 810 c (for instance). In some examples, the broadcast destination field may indicate an index into a table. If the broadcast flag 950 is high, the broadcast destination is interrogated, and an entry into a broadcast table is looked up using the broadcast destination index. The entry in the broadcast table indicates the destination shader cores for the broadcast.

Although not shown, there may be broadcast address fields for each entry. The broadcast address field indicates where in the shader cores local-storage circuitry the broadcast data is to be written to.

Although not shown, the level two cache may integrate a snoop filter. A snoop filter tracks the coherency state of each cache line by maintaining a directory. The snoop filter directory comprises a utilization field, that indicates which shader cores have allocated the cache line in their local-storage circuitry. In an example, the utilization field is a bit mask, the bitmask contains a bit per shader-core, if cache line is allocated in a shader core the associated bitmask entry is set.

The snoop filter utilization field is only used for coherent cache lines and is unused for non-coherent cache lines. Because machine learning memory accesses are deterministic, the memory accesses will typically be non-coherent. Therefore, in an example, for non-coherent cache lines, at least a portion of the snoop filter entry is used to store at least one of a broadcast flag 950, broadcast destination, broadcast address.

Often in such a system, a feature map may move through a number of layers of a neural network, with each processing circuit (e.g. shader core) continually operating with the same kernel (i.e. performing processing for the same layer). A feature map may therefore move between the processing circuits (e.g. shader cores) as it progresses through the neural network layers. Ordinarily, this would require an element of coherency control so that the data can be tracked through the shader cores. Further to this it is desirable to reduce bandwidth where possible—particularly if the processing has not actually changed the feature map.

FIG. 6 illustrates a flowchart 1000 that shows an alternative process to avoid extensive coherency control measures and to reduce the extent to which unnecessary bandwidth consumption occurs. The process may be performed, for instance, via a single shader core. At a step 1010, the kernel is received into tile buffers 820 a of the shader core 810 a. At a step 1020, the feature map is received into the tile buffers 820 a of the shader core 810 a. It will be appreciated that these two steps could, of course, be inverted. At a step 1030, a hash of the feature map is carried out and temporarily stored. The exact hashing algorithm that is used is not especially important but should be such that even small changes to the input will produce changes to the hash that is generated. At a step 1040, processing is performed using the kernel and the feature map in order to produce, for instance, an output feature map. This output feature map is initially stored within the tile buffers 820 a of the shader core 810 a. At a step 1050, the result (the output feature map) is hashed using the same hash algorithm used in step 1030. The two hashes are then compared to each other at step 1060, e.g. using compare circuitry 895 a, 895 b, 895 c. If the two hashes differ from each other then at step 1070, the result (the output feature map) is transmitted back to the level 2 cache 887 (or to a next level of the memory system). Otherwise, at step 1080, the job manager 835 is signalled that the processing is complete and either that no update needed to be provided or that the level 2 cache 887 (already) contains the latest version of the result.

The result of processing performed at one shader core 810 a cannot be transmitted directly to another core 810 b. Instead, the data is transmitted to the level 2 cache 887 and from there, the data can be transmitted to the next shader core 810 b as dictated by the job manager 835. In this way, the latest version of data can be stored in the level 2 cache 887 or can be stored in a single, known, specific shader core 810 a at any particular instance. Versioning of data is not required, nor can race conditions occur.

Separately to this, the comparison of ‘before’ and ‘after’ hashes makes it possible to determine whether the feature map has changed as a result of the processing that has been performed (i.e. whether the input feature map is identical to the output feature map). If the hashes are identical, then it might be expected that no change to the feature map has been made and so there is no need to expend bandwidth or energy in transmitting the same data back to the level 2 cache 887.

FIGS. 7A and 7B collectively show how the operational mode (broadcast kernel mode vs feature map broadcast mode) can be dynamically changed at runtime and how the selection of mode can be used to reduce the bandwidth consumption within the system. It particular, it will be appreciated that parallelism can be achieved either by keeping the kernel constant across all of the shader cores 810 a, 810 b, 810 c and having each of the shader cores 810 a, 810 b, 810 c operate on a different feature map, or by having the feature map remain constant across all of the shader cores 810 a, 810 b, 810 c and applying a different kernel to each of those feature maps. So, for example, if letters A, B, C, D represent feature maps and letters w, x, y, z represent kernels and numbers 1, 2, 3, 4 represent processor cores then one assignment of kernels to processor cores would be:

-   -   1: w     -   2: x     -   3: y     -   4: z

Then, the feature maps A, B, C, D could be broadcast one at a time to each of the four processor cores (1-4).

Alternatively, the feature maps could be assigned to the processor cores:

-   -   1: A     -   2: B     -   3: C     -   4: D

Then, the kernels w, x, y, z could be broadcast one at a time to each of the four processor cores (1-4).

Of course, in some examples, depending on the sizes, the broadcast might include multiple data elements (e.g. multiple kernels or multiple feature maps), which could be split at the cores and processed one after another.

The question of whether the kernel(s) or the feature map(s) should be broadcast is at least partly dependent on which of the two sets of data is the largest. This might change as layers are successively applied. For instance, consider a neural network consisting of three layers that operate as follows:

Layer 1 Layer 2 Layer 3 Height 224 112 56 Width 224 112 56 Channels 3 256 256 Kernels 256 256 512

That is to say that in the first layer, for instance, the height and width of the feature map is 224×224, and that 256 kernels are applied to the feature map. Meanwhile, there are three channels, which might represent the number of elements of data (e.g. red, green, and blue values for pixels) or could represent the number of different feature maps to which the present layer is to be applied (which might depend on the number of output feature maps generated by a previous layer). In a second layer, the height and the width of the feature map has decreased to 112×112 (e.g. via pooling from the previous layer). The number of channels has increased to 256, and the number of kernels remains at 256.

The data size for the input/intermediate feature map(s) is dependent on the height and width of the feature maps as well as the number of channels (namely height*width*channels) and therefore differs by each layer as follows:

Layer 1 Layer 2 Layer 3 Height 224 112 56 Width 224 112 56 Channels 3 256 256 Total IFM(s) size 150528 3211264 802816

Meanwhile, the data size for the kernel(s) is dependent on the height and width of the kernels, as well as the number of channels, and the number of kernels to be applied (namely height*width*channels*kernels) and therefore differs by each layer as follows:

Layer 1 Layer 2 Layer 3 Height 3 3 3 Width 3 3 3 Channels 3 256 256 Kernels 256 256 512 Total kernel(s) size 6912 589824 1179648 Note that the kernel height and width remain the same in each layer, but the number of channels and the number of kernels increases across the layers. Thus it can be seen that in the first layer, the IFM data is larger, and similarly in the second layer. But for the third layer, the kernel data is larger and so (other factors notwithstanding, as will be discussed below) it would be preferable to broadcast the larger kernel data to reduce the number of large transmissions being made.

FIG. 7A shows, in the form of a flowchart 1100, how the mode can be changed. At a step 1105, the kernel to be processed is fetched by the job manager 835 for the next machine learning job to be performed. At a step 1110, the feature map to be processed is also obtained. At a step 1111, it is determined whether the kernel size is bigger than an local-storage size of the processor circuits. If not then at step 1113, it is determined whether the feature map size is greater than a size of the local-storage of the processor circuits at step 1113. If not, then neither the kernel nor the feature maps are too large for the local-storage of the processor circuits and so at step 1115, it is determined whether the kernel size is bigger than the feature map size. If so, then the mode is set to kernel broadcast mode at step 1120 and otherwise, the mode is set to feature map broadcast at step 1125 in order to reduce the number of storage accesses (e.g. cache).

If, at step 1113, the feature map size is bigger than the local-storage size of the processor circuits, then the process proceeds straight to step 1120. This is because there is less value in broadcasting something that is too large for the local-storage of the processor circuits, since this would necessitate multiple broadcasts.

If, at step 1111, the kernel size is bigger than the local-storage size of the processor circuits, then the process proceeds to step 1112 where the size of the feature map is considered. If the feature map is not larger the size of the local-storage in the processor circuits then the reverse situation applies and so the mode is set to feature map broadcast mode in step 1125 in order to reduce the number of broadcasts taking place.

Finally, if at step 1112 the feature map size is larger than the local-storage size of the processor circuits then neither the kernel nor the feature map can fit within the local-storage of the processor circuits. In this case, the process proceeds to step 1115 where the largest of these items is broadcast.

This can be explained mathematically. Consider that the feature map is 1024 KB and the kernel is 256 KB and the local-storage of each processor circuit is 256 KB. A first option is to broadcast the kernel into the local-storage and stream the feature map to each of the k processor circuits. The total data transfer (in KB) for this would be 256+1024k. A second option is to broadcast one quarter of the feature map to each of the k processor circuits and to stream the kernel to each of the k processor circuits, and to repeat this process four times (1024/256=4). The total data transfer (in KB) for this would be (256+256k)*4=1024+1024k. It is therefore preferable, in general, to primarily store whichever of the feature map or kernel will actually fit in the internal memory of the processor circuits. If either will fit, then the selecting the largest to be stored should reduce the data transmission.

The mode is therefore set in order to cause the broadcast to occur in respect of whichever of the kernel and the feature map is larger that will fit within the local-storage of the processor circuit. This therefore reduces the amount of data transmitted (e.g. over the bus 805) by causing a single broadcast to occur in respect of the larger data structure while allowing a smaller number of individual transmissions to occur in respect of the smaller data structure.

FIG. 7B shows, in the form of another flowchart 1125, how the set mode can be used to dictate what gets transmitted. In particular, at a step 1130, it is determined whether the current mode is ‘kernel broadcast’ (as opposed to ‘feature map broadcast’). If the mode is ‘kernel broadcast’ then at step 1135, the kernel is broadcast to the relevant shader cores 810 a, 810 b, 810 c. Then, a different feature map is distributed to the shader cores 810 a, 810 b, 810 c at step 1140. Otherwise, at a step 1145, the feature map is broadcast to the relevant shader cores 810 a, 810 b, 810 c. Then, at a step 1150, a different kernel is distributed to the shader cores 810 a, 810 b, 810 c.

The above description makes particular reference to shader cores 810 a, 810 b, 810 c. However, it will be appreciated that the present technique is equally applicable to processor cores making up a CPU. In such embodiments, the job manager 835 may take the form of one of the processor cores themselves. The execution engine 830 a, 830 b, 830 c of each core could constitute a pipeline or single execution unit and the tile buffers could take the form of level one caches (assuming each level one cache is specific to the core).

The work dispatch process for the above examples will now be described with reference to the flowchart of FIG. 8 .

The process begins at step 1210. At a step 1220, an underlying application (that may execute on a CPU 115 for instance) executes a process that involves machine learning tasks. Details of the processing to be performed is communicated to a driver 125 using an API, such as OpenCL, OpenGL, DirectX, Metal or Vulkan. At a step 1230, the driver 125 interprets the processing to be performed. The driver 125 determines how to perform the processing on the resources available (e.g. at the GPU 175/875 and/or the CPU 115 itself). At step 1240, the driver 125 generates the necessary data for the process to be performed by, e.g. the GPU 175/875. This includes a job list including a list of tasks to be performed, together with any dependencies between those tasks. This might also include affinity information that indicates whether multiple specific tasks should be started together or whether one shader core should be used to perform one specific task after another specific task. The generated information also indicates whether fetched data (kernels, feature maps, programs (e.g. shader programs), or data structures and so on) is common to all the shader cores 110/810 (e.g. common data), and should be broadcast to all the shader cores 110/810 or whether data is unique to a specific shader core (e.g. unique data) it should be unicast to a specific shader core 110/810. The driver 125 also generates data structures and programs (e.g. “shader programs”) to be executed on the GPU. These data structures, programs, job lists, affinity information and so on is written to, for instance, main memory 160. At a step 1245, the job manager (JM) 135/185 or command stream frontend (CSF) reads from the top of job list to determine whether there is any job in the job list that has no unmet dependencies. If the dependencies for a job are not met, then at step 1250, the process returns to step 1245. Otherwise, at step 1255, the job manager 135/185 determines whether there are any shader cores 110/810 that are unallocated and idle. This step loops until such a core becomes available. Then, at step 1256, the job manager considers the affinity information as to whether a particular shader core 110/810 is better suited to the specific task. If so, then at step 1257, that shader core 110/810 is selected, if it is unallocated and the process proceeds to step 1260. Otherwise, at step 1258, any of the unallocated shader cores 110/810 is selected and the process proceeds to step 1260. At step 1260, the selected available shader core 110/810 is messaged with the job to be performed, and the corresponding job is removed from the job list. The selected shader core 110/810 then performs the job, where identified common data is fetched from main memory to storage circuitry, (e.g. level two cache), the broadcast circuitry then broadcasts the data to the local-storage circuitry in a plurality of processing circuits. Identified unique data is fetched from main memory to the local-storage circuitry in one of the processing circuits. In an example, the unique data is fetched into the storage circuitry, (e.g. level two cache), and then fetched from the storage circuitry (e.g. level two cache) to the local-storage circuitry in one of the processing circuits. At step 1265, it is determined whether more jobs exist in the job list. If so, the process returns to step 1245. Otherwise, the work dispatch process ends at step 1270.

Programs generated by the driver could anticipate the time taken to access main memory and therefore implement pre-fetching—that is, the fetching of data in advance of it being needed. This is often possible where machine learning is used due to the deterministic nature of machine learning operations and the associated deterministic memory accesses. In some examples, although generated programs may be provided to all shader cores 110, such prefetching instructions can be configured to only be executed by one of the shader cores 110. This is because all shader cores 110 can benefit from the availability of the data in the cache 187 even if the prefetch request only came from one shader core 110. The core that performs the prefetching could be a shader core executing a pilot shader program, which does not perform the computations but merely fetches the required data to cause the data to be transferred from main memory 160 to the cache 187. Prefetching can also be provided by a dedicated prefetch engine in a shader core 110 or associated with a cache such as the level two cache 187.

Concepts described herein may be embodied in computer-readable code for fabrication of an apparatus that embodies the described concepts. For example, the computer-readable code can be used at one or more stages of a semiconductor design and fabrication process, including an electronic design automation (EDA) stage, to fabricate an integrated circuit comprising the apparatus embodying the concepts. The above computer-readable code may additionally or alternatively enable the definition, modelling, simulation, verification and/or testing of an apparatus embodying the concepts described herein.

For example, the computer-readable code for fabrication of an apparatus embodying the concepts described herein can be embodied in code defining a hardware description language (HDL) representation of the concepts. For example, the code may define a register-transfer-level (RTL) abstraction of one or more logic circuits for defining an apparatus embodying the concepts. The code may be define a HDL representation of the one or more logic circuits embodying the apparatus in Verilog, SystemVerilog, Chisel, or VHDL (Very High-Speed Integrated Circuit Hardware Description Language) as well as intermediate representations such as FIRRTL. Computer-readable code may provide definitions embodying the concept using system-level modelling languages such as SystemC and SystemVerilog or other behavioural representations of the concepts that can be interpreted by a computer to enable simulation, functional and/or formal verification, and testing of the concepts.

Additionally, or alternatively, the computer-readable code may define a low-level description of integrated circuit components that embody concepts described herein, such as one or more netlists or integrated circuit layout definitions, including representations such as GDSII. The one or more netlists or other computer-readable representation of integrated circuit components may be generated by applying one or more logic synthesis processes to an RTL representation to generate definitions for use in fabrication of an apparatus embodying the invention. Alternatively, or additionally, the one or more logic synthesis processes can generate from the computer-readable code a bitstream to be loaded into a field programmable gate array (FPGA) to configure the FPGA to embody the described concepts. The FPGA may be deployed for the purposes of verification and test of the concepts prior to fabrication in an integrated circuit or the FPGA may be deployed in a product directly.

The computer-readable code may comprise a mix of code representations for fabrication of an apparatus, for example including a mix of one or more of an RTL representation, a netlist representation, or another computer-readable definition to be used in a semiconductor design and fabrication process to fabricate an apparatus embodying the invention. Alternatively, or additionally, the concept may be defined in a combination of a computer-readable definition to be used in a semiconductor design and fabrication process to fabricate an apparatus and computer-readable code defining instructions which are to be executed by the defined apparatus once fabricated.

Such computer-readable code can be disposed in any known transitory computer-readable medium (such as wired or wireless transmission of code over a network) or non-transitory computer-readable medium such as semiconductor, magnetic disk, or optical disc. An integrated circuit fabricated using the computer-readable code may comprise components such as one or more of a central processing unit, graphics processing unit, neural processing unit, digital signal processor or other components that individually or collectively embody the concept.

In the present application, the words “configured to . . . ” are used to mean that an element of an apparatus has a configuration able to carry out the defined operation. In this context, a “configuration” means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. “Configured to” does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.

Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes, additions and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims. For example, various combinations of the features of the dependent claims could be made with the features of the independent claims without departing from the scope of the present invention.

The technology described herein also extends to a computer software carrier comprising such software which when used to operate a processor, renderer or microprocessor system comprising data processor causes in conjunction with said data processor said processor, renderer or microprocessor system to carry out the steps of the methods of the technology described herein. Such a computer software carrier could be a physical storage medium such as a ROM chip, CD ROM, RAM, flash memory, or disk, or could be a signal such as an electronic signal over wires, an optical signal or a radio signal such as to a satellite or the like.

The technology described herein may accordingly suitably be embodied as a computer program product for use with a computer system. Such an implementation may comprise a series of computer readable instructions either fixed on a tangible, non-transitory medium, such as a computer readable medium, for example, diskette, CD-ROM, ROM, RAM, flash memory, or hard disk. It could also comprise a series of computer readable instructions transmittable to a computer system, via a modem or other interface device, over either a tangible medium, including but not limited to optical or analogue communications lines, or intangibly using wireless techniques, including but not limited to microwave, infrared or other transmission techniques. The series of computer readable instructions embodies all or part of the functionality previously described herein.

Those skilled in the art will appreciate that such computer readable instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Further, such instructions may be stored using any memory technology, present or future, including but not limited to, semiconductor, magnetic, or optical, or transmitted using any communications technology, present or future, including but not limited to optical, infrared, or microwave. It is contemplated that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation, for example, shrink-wrapped software, pre-loaded with a computer system, for example, on a system ROM or fixed disk, or distributed from a server or electronic bulletin board over a network, for example, the Internet or World Wide Web.

The described embodiments were chosen in order to best explain the principles of the technology described herein and its practical applications, to thereby enable others skilled in the art to best utilise the technology described herein, in various embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope be defined by the claims appended hereto. 

1. A processor comprising: a plurality of processor circuits to perform a machine learning process; and broadcast circuitry configured to broadcast data to at least a subset of the plurality of processor circuits; wherein the processor is configured to: obtain at least a first subset of machine learning data from memory to storage circuitry; and broadcast the at least a first subset of machine learning data from storage circuitry to at least a subset of the plurality of processor circuits.
 2. The apparatus according to claim 1, comprising: fetch circuitry, the fetch circuitry configured to fetch or stream machine learning data from memory to storage circuitry.
 3. The apparatus according claim 1, wherein the broadcast circuitry is configured to broadcast first machine learning data from storage circuitry to local-storage circuitry in the at least a subset of the plurality of processor circuits.
 4. The apparatus according to claim 2, comprising: the fetch circuitry configured to fetch or stream first compressed machine learning data from memory; and decompression circuitry configured to decompress first compressed machine learning data to generate first decompressed machine learning data; and the fetch circuitry configured to fetch first decompressed machine learning data from decompression circuitry to storage circuitry.
 5. The apparatus according to claim 1, wherein the storage circuitry is a cache.
 6. The apparatus according to claim 1, wherein the storage circuitry is configured to store, in association with each entry, an indication of whether that entry is to be broadcast by the broadcast circuitry.
 7. The apparatus according to claim 1, wherein the broadcast circuitry is configured to broadcast the first machine learning data to at most a subset of the plurality of processor circuits; and the broadcast circuitry is configured to broadcast third machine learning data, different to the first machine learning data, to a further subset of the plurality of processor circuits; the subset of the plurality of processor circuits and the further subset of the plurality of processor circuits are mutually exclusive; and the first machine learning data and the third machine learning data relate to different layers of a neural network.
 8. The apparatus according to claim 1, wherein the processor is a tile-based graphics processor; and the processor circuits are shader cores; and the storage circuitry is a cache; and the local-storage circuitry is a tile buffer in a shader core, wherein the broadcast circuitry is configured to broadcast the first machine learning data from the cache to tile buffers in at least a subset of the plurality of shader cores.
 9. The apparatus according to claim 1, further the processor is configured to: obtain at least a second subset of machine learning data from memory to storage circuitry; and transfer the at least a second subset of machine learning data from storage circuitry to at least a subset of the plurality of processor circuits; wherein the at least subset of second machine learning data is different for each of the at least a subset of the processor circuits.
 10. The apparatus according to claim 9, comprising: dispatch circuitry to cause processor circuits to process machine learning data, wherein the dispatch circuitry configured to cause each of the at least a subset of the processor circuits to process its first machine learning data with the second machine learning data.
 11. The apparatus according to claim 9, wherein either the first machine learning data is a kernel and the second machine learning data is a feature map, or the first machine learning data is the feature map and the second machine learning data is the kernel.
 12. The apparatus according to claim 9, wherein the apparatus is configured to operate in a kernel broadcast mode in which the first machine learning data is a kernel and the second machine learning data is a feature map; and the apparatus is configured to operate in a map broadcast mode in which the first machine learning data is the feature map and the second machine learning data is the kernel.
 13. The apparatus according to claim 12, wherein the apparatus is configured to dynamically change between the map broadcast mode and the kernel broadcast mode.
 14. The apparatus according to claim 12, wherein the apparatus is configured to dynamically change between the map broadcast mode and the kernel broadcast mode in dependence on a layer of neural network to which the kernel and the feature map relate.
 15. The apparatus according to claim 1, the storage circuitry comprising: snoop filter circuitry, the snoop filter circuitry configured to store, in association with each entry, for coherent traffic coherency state, and for non-coherent traffic to store, an indication of whether that entry is to be broadcast by the broadcast circuitry.
 16. The apparatus according to claim 15, further comprising: snoop filter circuitry configured to a store snoop filter entry, the snoop filter entry to store at least one of a broadcast flag, a broadcast destination, or a broadcast address, for a non-coherent entry.
 17. The apparatus according to claim 1, comprising: a host processor configured to execute a driver; and job manager circuitry configured to dispatch tasks to at least a subset of the plurality of processor circuits, wherein the driver is configured to analyses layer processing of a neural network and to generate a job list to schedule processing of a neural network to at least a subset of the processing circuits; and the job manager circuitry is configured to process the job list generate by the driver, wherein the job manager circuitry is configured to determine available plurality of processing circuits and using the job list dispatch tasks to at least a subset of the plurality of processor circuits.
 18. The apparatus according to claim 17, wherein the driver is configured to select between map broadcast mode and kernel broadcast mode to minimise memory accesses to the at least subset of the plurality of processing circuits in dependence on the size of the kernel and feature map associated with the layer; and the available storage circuitry is associated with each of the processing circuits; and the number of at least a subset of the plurality of processing circuits processing the layer.
 19. A data processing method of transferring data to a plurality of processing circuits, the data processing method comprising: fetching at least a first subset of machine learning data from memory to storage; and broadcasting the at least a first subset of machine learning data from storage to at least a subset of the plurality of processor circuits.
 20. A non-transitory computer-readable medium to store computer-readable code for fabrication of an apparatus comprising: a plurality of processor circuits to perform a machine learning process; and broadcast circuitry configured to broadcast data to at least a subset of the plurality of processor circuits; wherein the processor is configured to: obtain at least a first subset of machine learning data from memory to storage circuitry; and broadcast the at least a first subset of machine learning data from storage circuitry to at least a subset of the plurality of processor circuits. 